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Abstract: It has become an important model in various applications such as education, 
business and others. So security of the data that is communicated through internet is 
necessary. Secure network is maintained by Intrusion Detection System (IDS). IDS observes 
the data traffic carefully and identifies it as normal or spam. Nowadays most of the 
applications depends on the advance network technologies namely wireless networks, 
wireless sensor networks and Bluetooth. In case of wireless sensor networks security 
mechanisms such as key-management protocols, authentication techniques and security 
protocols cannot be used because of resource constraints. Intrusion Detection System is the 
ideal security mechanism for wireless sensor networks. In this paper classification and 
predictive models for intrusion detection are built by using machine learning classification 


algorithms. 
Keywords: Threat Analysis; Information Security; Data security. 
1. Introduction 


A security mechanism used to monitor the abnormal behavior of the network is an 
Intrusion Detection System (IDS). The IDS identifies and informs that whether the user 
activity is normal or not. The user’s activities are compared by the IDS with the already 


stored intrusion records to identify the intrusion. 


Accurate predictive models can be built for large data sets using supervised machine 
learning techniques that are not possible by traditional methods [1-7]. As specified by Tom 
Mitchell, machine learning based intrusion detection falls under two categories Anomaly and 
Misuse. IDS learns the patterns by the training data, so the misuse based method is used. 
Misuse based detection can detect only the known attack, new attacks cannot be identified. 
Anomaly based IDS observes the normal behavior and if there is a change in the behavior 
then it considers that behavior as anomaly. So anomaly based IDS can detect new attacks that 


are not learned from the training model. 


Till now different machine learning techniques such as artificial neural networks, 


Support Vector Machine and Naive Bayes based techniques are proposed for the intrusion 
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detection. A new detection by combining different techniques, a hybrid detection technique is 
proposed. The literature on comparison of supervised machine learning techniques in 
intrusion detection is limited [8-19]. Hence this paper aims at understanding the implications 


of using supervised machine learning techniques on intrusion detection. 


Recently authors used algorithms namely Random Forest, Naive Bayes, K-means and 
Support Vector Machine to identify four types of attacks. They also proposed best feature 
selection method. Concluded that the Random Forest Classifier (RFC) outperforms the other 
methods. They have mentioned that hierarchical clustering method can be used to improve the 


performance. 


Authors proposed semi supervised machine learning based intrusion detection. 
Authors have not considered the resource consumption. Combination of different classifiers 
to identify the intrusion is proposed. They used supervised classification or unsupervised 
clustering for filtering of the data. They used NSL-KDD data-set and tested with decision tree 
classifier [20-29]. But the proposed method works only for binary class classification. 
Authors proposed intrusion detection system using supervised machine learning techniques 
to identify the on line network data as normal or not. The proposed method identifies probe 


and Denial of Service attacks only, but the other attacks are not considered. 


A framework of machine learning approach is proposed. Intrusion is identified by 
analyzing the local features. Researchers proposed Naive Bayes based multiclass classifier to 
identify the intrusions. They suggested that intrusion detection is possible by Hidden Naive 


Bayes (HNB) model. 


Denial of Service attacks is identified with good accuracy compared to other attacks. 
Studies suggested Intrusion detection technique using Support Vector Machine (SVM). They 
also used feature removal method to improve the efficiency. Using the proposed feature 
removal method they selected best nineteen features from the KDD-CUP99 data-set. In the 
proposed method the data set used is very small [30-43]. A light weight IDS is proposed. The 
proposed method mainly focused on pre-processing of the data so that only important 
attributes can be used. The first step is to remove the redundant data so that the learning 


algorithms give the unbiased result. 


A survey on intrusion detection systems was conducted. Information about IDSs such 
as classification, Intrusion type, computing location and infrastructure are discussed. They 


discussed about the Mobile Adhoc Networks (MANET) IDS. They compared MANETIDS 
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and the Wireless Sensor Networks (WSN) IDS. Authors suggested that for mobile 
applications distributed and cooperative IDS schemes are suitable. For stationary applications 
centralized IDSs are suitable and for cluster based applications hierarchical IDSs are suitable. 


Authors proposed intrusion detection framework to detect routing attacks. 


Specification based approach is used to detect routing attacks. Authors claim that the 
proposed method has low False Positive Rate (FPR) and good intrusion detection rate. The 
proposed method works only for static networks [44-56]. Investigators developed IDS for 
Sink, Cluster Head (CH) and for a Sensor Node (SN) separately and combined altogether to 
identify the intrusion in heterogeneous Cluster Based Wireless Sensor Networks (CWSN) but 


the detection rate for U2R, R2L and Probe attacks is very low. 
2. Machine Learning Techniques 


This work is to design a network intrusion detection system with the different 
supervised machine learning classifiers. This paper is to investigate the performance of the 
classifiers namely Logistic Regression, Support Vector Machine (SVM), Gaussian Naive 
Bayes (GNB) and Random Forest in intrusion detection. These classifiers are discussed 


below. 


Logistic regression - To solve the classification problems Logistic Regression (LR) is 
used. TLR works for both binary classification and multiclass classification. Probability of 
occurrence of an event is predicted by fitting data to the Logistic function. The values 
selected by the logistic function are in the range 0 and 1. If the value is 0.5 and above then 


it is labeled as otherwise 0. 


Support vector machine - Mainly for classification problems Support Vector Machine 
algorithm is used, but it can be used in regression problems also. N-dimensional feature space 
is considered to plot each data item as a point with the value of each feature as a particular 
coordinate. Then classification is made by finding the hyper-plane that differentiates the two 
classes very well. Support Vectors are the co-ordinates of specific observation that lies closest 
to the boarder line. In case of SVM training samples are divided into different subsets called 
as support vectors, the decision function is specified by these support vectors. This paper is 


based on the liner kernel method of SVM for intrusion detection. 


Gaussian naive Bayes - The Gaussian Naive Bayes algorithm is the supervised 
learning method. Probabilities of each attribute whichbelongs to each class are considered for 


a prediction. This algorithm is assumes that the probability of each attribute belonging to a 
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given class value is not depends on all other attributes. If the value of the attribute is known 
the probability of a class value is called as the conditional probabilities. Data instances 
provability can be found out by multiplying all attributes conditional probabilities together. 
Prediction can be made by calculating the each class instanceprobabilities and by selecting the 


highest probability class value. 


Random forest classifier - In 2001 Breiman proposed the random forest machine 
learning classifier. It is a collaborative method which works based on the proximity search. It 
is decision tree based classifier. It makes use of standard divide and conquers approach to 
improve the performance. The main principle behind random forest is that strong learner 


group is created by a group of weak learners. It is applicable to disjunctive hypothesis. 
2.1. Intrusion Detection 


The standard intrusion detection data set KDDCUP9923 has redundant records. This 
may lead to unfair result of the machine learning algorithms. So the supervised machine 
learning algorithms are tested NSL-KDD24 data-set which is the advanced version of the 
KDDCUP99 intrusion detection data-set. It has 42 features and the four simulated attacks. 
Denial of Service (DoS) attack: Over usage of the bandwidthor non-availability of the system 
resources leads to the DoS attacks. Examples: Neptune, Teardrop and Smurf. 
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Fig.1. Existing Methodology (Source: Internet) 


User to Root (U2R) Attack: Initially attacker access normal user account, later gain 


access to the root by exploiting the vulnerabilities of the system. Examples: Perl, Load 
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Module and Eject attacks. Probe Attack: Have an access to entire network information before 
introducing an attack. Examples: ipsweep, nmap attacks. Root to Local (R2L) Attack: By 
exploiting some of the vulnerabilities of the network attacker gains local access by sending 


packets on a remotemachine. Examples: imap, guess password and ftp- write attacks. 


In pre- processing step all the categorical data which are in textual form are converted to 
numerical form. Pre-processed data is divided as testing data and training data. The models 
are built using Logistic Regression, Gaussian Naive Bayes, Support Vector Machine and 
Random Forest classifiers. These models are used for predicting the labels of the test data. 
Actual labels and predicted labels are compared. Accuracy, True Positive Rate (TPR) and 
False Positive Rate (FPR) are computed. Based on these parameters performance of the 


models are compared. Following steps are used to build the models. 
¢ Pre-process the data set. 
¢ The data set is divided as training data and testing data 
¢ Build the classifier model on training data for 
* Logistic Regression 
¢ Support Vector Machine 
¢ Gaussian Naive Bayes 
¢ Random Forest 
¢ Read the test data 
¢ Test the classifier models on training data 


¢ Compute and compare TPR, FPR, Precision, Recall, Fl-Score and Accuracy for all 


the models. 


Supervised machine learning algorithms namely Logistic Regression, Gaussian Naive 
Bayes, Support Vector Machine and Random Forest are tested on NSL-KDD dataset, the new 
standard intrusion detection data-set. These algorithms are tested on Intel Core (TM) 15- 
3230M CPU @2.60 GHZ, 4 GB RAM and coding is done by Python. The result of the 
experiment is represented as a Reliability curve. In Reliability curve estimated probabilities 
are plotted against the true empirical probabilities. The Reliability Curve for the above 
mentioned supervised machine learning classifiers. Reliability curve for the ideal classifier 


falls near the diagonal because the estimated probabilities and empirical probabilities are 
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nearly equal. 


Estimated probabilities values ranging from 0 to 0.1, to 0.2 and so on. The values 0 to 
0.1 belongs to I bin, 0.1 to 0.2 belongs to II bin and similarly the other ranges. It can be 
concluded that the Random Forest classifier out performs the other methods in identifying the 
network traffic as normal or an attack. Whereas the SVM identifies the intrusion with the 


lowest probability estimate. 


Quality of the classification models is identified by plotting the Receiver Operating 
Characteristics (ROC) curve. Random Forest has highest TPR. Hence, the ROC curve for 
Random forest is plotted separately. By observing the graphs, it can be concluded that the 
Random forest classifier has lowest FPR and highest TPR in identifying attacks. It 
outperforms the other techniques. Whereas Support Vector Machine has highest FPR (39%) 


and minimal TPR (75%) for intrusion detection. 


This is due to the fact that too many features from the data set is considered15 and 
SVM’s linear kernel functionis used. Based on the results it can be identified that Random 
Forest classifier with the highest accuracy, outperforms the other methods. Whereas SVM has 
the lowest accuracy, Logistic Regression algorithm has the good accuracy than Gaussian 


Naive Bayes and SVM. 
3. Conclusions 


With the rapid development of real-time big data and the Internet of Things, the 
demand for surrounding environmental data has also increased significantly. The demand for 
WSN products with low node costs and easy deployment will gradually expand. WSN 
products can break through traditional detection methods. They reduce the costs of 
environmental testing, and also greatly reduce the cumbersome process of traditional testing 
methods. As a new network, the WSN has been widely studied by scientific researchers and 
widely used in industry since its inception. Common applications include scenarios such as 
environmental detection, military operations, and information positioning. An attempt has 
been made to check the performance of the supervised machine learning classifiers namely 
Support Vector Machine, Random Forest, Logistic Regression and Gaussian Naive Bayes are 


compared for intrusion detection. 
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